Main Content
Preparing the Data
Firstly, create a new project in R (always use projects when working
in R!!), create a new folder to store your data in and download the
Space Titanic data from Kaggle into that new directory. The data is
already split into two files train.csv and
test.csv. Something different about the data in
test.csv is that it doesn’t contain any ground truth labels
and so we won’t know whether our predictions are correct or not.
We can start by loading in the data
set.seed(123)
library(tidyverse)
raw_data <- read_csv("./Data/00_Raw_Data/train.csv")
head(raw_data)
We can use the skimr package to get a summary of all the
variables in the data set.
library(skimr)
skim(raw_data)
| Name | raw_data |
| Number of rows | 8693 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 3 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1.00 | 7 | 7 | 0 | 8693 | 0 |
| HomePlanet | 201 | 0.98 | 4 | 6 | 0 | 3 | 0 |
| Cabin | 199 | 0.98 | 5 | 8 | 0 | 6560 | 0 |
| Destination | 182 | 0.98 | 11 | 13 | 0 | 3 | 0 |
| Name | 200 | 0.98 | 7 | 18 | 0 | 8473 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| CryoSleep | 217 | 0.98 | 0.36 | FAL: 5439, TRU: 3037 |
| VIP | 203 | 0.98 | 0.02 | FAL: 8291, TRU: 199 |
| Transported | 0 | 1.00 | 0.50 | TRU: 4378, FAL: 4315 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 179 | 0.98 | 28.83 | 14.49 | 0 | 19 | 27 | 38 | 79 | ▂▇▅▂▁ |
| RoomService | 181 | 0.98 | 224.69 | 666.72 | 0 | 0 | 0 | 47 | 14327 | ▇▁▁▁▁ |
| FoodCourt | 183 | 0.98 | 458.08 | 1611.49 | 0 | 0 | 0 | 76 | 29813 | ▇▁▁▁▁ |
| ShoppingMall | 208 | 0.98 | 173.73 | 604.70 | 0 | 0 | 0 | 27 | 23492 | ▇▁▁▁▁ |
| Spa | 183 | 0.98 | 311.14 | 1136.71 | 0 | 0 | 0 | 59 | 22408 | ▇▁▁▁▁ |
| VRDeck | 188 | 0.98 | 304.85 | 1145.72 | 0 | 0 | 0 | 46 | 24133 | ▇▁▁▁▁ |
In this data set we are trying to predict the variable
Transported. It’s important to check whether is a class
imbalance (check that there are roughly the same number of people
transported and not transported).
raw_data %>%
group_by(Transported) %>%
summarise(
Count = n()
)
Now we can start looking at transforming our data ready for machine
learning. There are two ways of doing this, manually transforming it or
creating a data preprocessing object to attach to the model. A data
preprocessing object will take the data that you are passing into the
model and automatically tranform it into the correct form for the model.
This is good because it means you only need to write this proprocessing
object once and it will automatically transform any new data you give
your model, you can also deploy your model with this preprocessing
object attached meaning other people don’t have to preprocess their data
into the same form before using it, making it easier to share. The
downside is that these preprocessing objects can be hard to code,
tidymodels makes it much easier but it can still be a bit
fiddly. In this walkthrough we use a combination of both methods.
The first variable that we are going to tackle is Cabin
this is made up of the cabin’s deck, number and side of ship all
separated by forward slashes. We can convert this one column into three
using the separate function as shown below.
data_proc <- raw_data %>%
separate(
Cabin, into = c("Deck", "Num", "Side"), sep = "/"
)
data_proc
We can then take all of the non-numeric variables and convert them
into the correct form to be read by the model. Firstly, just for ease of
interpretibility I take the logical value Transported and
recode it as "Transported" and
"Not Transported" rather than TRUE and
FALSE. I then convert all of the categorical variables
(characters and logicals) into factors (I also convert Num
into a numeric variable, it wasn’t converted before as it contains
NA values). We don’t convert the PassengerId
and Name variables as we aren’t going to use these as a
part of our model.
data_proc <- data_proc %>%
mutate(
Transported = ifelse(Transported, "Transported", "Not Transported"),
Transported = factor(Transported, levels = c("Transported", "Not Transported")),
VIP = factor(VIP),
CryoSleep = factor(CryoSleep),
HomePlanet = factor(HomePlanet),
Deck = factor(Deck),
Num = as.numeric(Num),
Side = factor(Side),
Destination = factor(Destination)
)
data_proc
You should spend a very very very long time performing exploratory data analysis getting to know your data and thinking about how it will interact with your model. We don’t have time to do that here so let’s just do a basic look at the numeric variables. Below I select the numeric variables and plot each of their distributions.
data_proc %>%
select_if(function(col) is.numeric(col)) %>%
pivot_longer(
cols = 1:ncol(.),
names_to = "variable",
values_to = "value"
) %>%
ggplot() +
aes(
x = value
) +
geom_histogram() +
facet_wrap(
~ variable,
scales = "free"
) +
scale_y_continuous(
trans = scales::log1p_trans()
) +
theme_bw()
I then do the same thing but log1p transform all of the
numeric variables. The log1p transform is like the normal
log function except you add one to the number before
logging it. This avoids problems with trying to log 0 which ends in
undefined values.
data_proc %>%
select_if(function(col) is.numeric(col)) %>%
pivot_longer(
cols = 1:ncol(.),
names_to = "variable",
values_to = "value"
) %>%
ggplot() +
aes(
x = value
) +
geom_histogram() +
facet_wrap(
~ variable,
scales = "free"
) +
scale_y_continuous(
trans = scales::log1p_trans()
) +
scale_x_continuous(
trans = scales::log1p_trans()
) +
theme_bw()
You can see that Age and Num are more
evenly balanced in the original data, but the rest are more evenly
balanced in the logged data. Therefore, all numeric variables are logged
except Age and Num.
data_proc <- data_proc %>%
mutate(
RoomService = log1p(RoomService),
FoodCourt = log1p(FoodCourt),
ShoppingMall = log1p(FoodCourt),
Spa = log1p(Spa),
VRDeck = log1p(VRDeck)
)
Finally another consideration is whether there are any variables that
are correlated with each other. We can do this using the
ggpairs function from the GGally package.
library(GGally)
data_proc %>%
select_if(function(col) is.numeric(col)) %>%
ggpairs() +
theme_bw()
Here we can see FoodCourt and ShoppingMall
are directly correlated and so one can be removed, increasing the power
of the model whilst not losing any predictive power. We also can remove
other variables that are not going to be used by the model.
data_proc <- data_proc %>%
select(
-FoodCourt,
-PassengerId,
-Name
)
We have loaded our training set and we also have a testing set in the
test.csv. However, our testing data set doesn’t have any
ground truth values for us to estimate how well our model will perform
on unseen data. Due to this, we can create a validation set and use this
like a test set to estimate our model’s performance. You would use this
same code (just naming your variables differently) to do a typical
test/train split. We use the strata = Transported argument
to make the proportion of people Transported the same in the training
and validation sets.
library(tidymodels)
data_split <- initial_split(data_proc, strata = Transported)
data_train <- training(data_split)
data_val <- testing(data_split)
Create Data Preprocessor
We can also create a data preprocessor that can be preappended to the
model to automatically preprocess the data. tidymodels uses
a package called recipes to do this.
To start we create a recipe we have to give it a data set, in this
case we have given it data_train. It is very
important the you only give the recipe the training data and not the
testing data, this is to prevent something called data leakage.
We also have to give the recipe a formula which tells it what variable
we are trying to predict using what other variables. Here we have
specified Transported ~ . which means we are trying to
predict Transported using all of the other variables in the
data set.
After defining the recipe, we then add steps to tell the preprocessor
what operations to perform. The first step I have added is
step_nzv, this is a pretty standard first step which looks
through all of your data and removes any variables that have
near zero variance.
These are variables that don’t change that much, if at all, and
therefore are unlikely to contain much informtation. As well as saying
what step should be performed, we also need to say what variables this
step should be performed on, in this case I used the
all_predictors function which will automatically apply it
to every variable we are using to make this prediction.
I then decided to center and scale all of my variables using
step_center and step_scale, using the selector
all_numeric_predictors to only apply this step to variables
that are numbers. If you have a predictor where all of the values are
the same number, then step_scale will fail as this variable
has zero variance, this is another example why step_nzv can
be a good idea.
An important part of machine learning is data imputation, or the
process of dealing with missing data. There are much better ways to do
it than this and it will be different for every data set but here is an
example of one method you can perform inside a recipe.
step_impute_knn will find any point with missing data and
will then, by default, find 5 other data points that look the most like
it. It will then look at these 5 closest data points, specifically at
the value that is missing, and calculate the average. It will then
replace the missing data value with this average.
The final step in the recipe I have made is step_dummy.
What this does is it turns categorical variables into numeric ones. It
does this by converting one factor variable into lots of variables, with
each factor (except one) being represented in its own column. If the
observation is of the category given by the factor then the
corresponding column will be a 1 and it will be a 0 otherwise (one
factor is left out, if the observation is of this category then all of
the columns will be 0). This step was applied to all of the factors
being used for prediction using the all_factor_predictors
selector.
I then use the prep and juice functions to
print out an example of the transformed data. prep will fit
the preprocessor to your training data set and juice will
extract the transformed training data (you can transform new data using
the bake function). After the model is fit each new
observation will be processed the same as the training data, i.e. the
new observations will be centered and scaled with the mean and variance
of the training data, not the new data. Although I use them to show the
data here, do not create a variable storing the recipe after
prep has been called otherwise what we do later won’t
work.
data_rec <- recipe(Transported ~ ., data = data_train) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_impute_knn(all_predictors()) %>%
step_dummy(all_factor_predictors())
data_rec %>% prep() %>% juice()
Train a Model
For this walkthrough I am going to train a very simple linear model.
There are many more complicated models that will perform a lot better on
this data set and the great thing about tidymodels is that
you train those models in almost exactly the same way. However, you do
need to have some understanding of how those models work in order to get
all the settings right, hence why I’m keeping it simple. As this is a
classification problem, I am going to be using logistic regression which
is a form of linear model that can differentiate between two classes.
There is also another extension of linear regression called multinomial
regression that can do any number of classes, but we will stick with
logistic here.
Firstly, we aren’t going to just be creating a model, we are going to
be creating a workflow. A workflow is a series of processes that will be
applied to your data, one of which is a model. Our workflow is going to
have two stages, the data preprocessor we have already created and our
logistic regression model. So we create the workflow with the
workflow function and then add the preprocessor (or recipe)
and the model. It is important to remember that when we created our
recipe we said what variable we were going to predict and what variables
we were going to use to predict. So when we add the recipe that
information is made available to the workflow.
titanic_workflow <- workflow() %>%
add_recipe(data_rec) %>%
add_model(logistic_reg())
We can then fit the model using the fit function.
titanic_fit <- fit(
titanic_workflow,
data_train
)
We can now use this model to start predicting outcomes. Here we predict the ground truth values of
preds_train <- predict(titanic_fit, data_val)
library(caret)
confusionMatrix(preds_train$.pred_class, data_val$Transported)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Transported Not Transported
## Transported 877 253
## Not Transported 218 826
##
## Accuracy : 0.7833
## 95% CI : (0.7654, 0.8005)
## No Information Rate : 0.5037
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5666
##
## Mcnemar's Test P-Value : 0.1172
##
## Sensitivity : 0.8009
## Specificity : 0.7655
## Pos Pred Value : 0.7761
## Neg Pred Value : 0.7912
## Prevalence : 0.5037
## Detection Rate : 0.4034
## Detection Prevalence : 0.5198
## Balanced Accuracy : 0.7832
##
## 'Positive' Class : Transported
##
Test the Model
To create predictions for our test set we can load it into R, perform the same transformations that we performed to our training data and then pass it through our model.
data_test <- read_csv("./Data/00_Raw_Data/test.csv")
data_test_proc <- data_test %>%
separate(
Cabin, into = c("Deck", "Num", "Side"), sep = "/"
) %>%
mutate(
VIP = factor(VIP),
CryoSleep = factor(CryoSleep),
HomePlanet = factor(HomePlanet),
Deck = factor(Deck),
Num = as.numeric(Num),
Side = factor(Side),
Destination = factor(Destination)
) %>%
mutate(
RoomService = log1p(RoomService),
FoodCourt = log1p(FoodCourt),
ShoppingMall = log1p(FoodCourt),
Spa = log1p(Spa),
VRDeck = log1p(VRDeck)
) %>%
select(
-FoodCourt,
-PassengerId,
-Name
)
preds_test <- predict(titanic_fit, data_test_proc)
head(preds_test$.pred_class, 20)
## [1] Transported Not Transported Transported Transported
## [5] Transported Not Transported Transported Transported
## [9] Transported Transported Not Transported Not Transported
## [13] Not Transported Transported Not Transported Not Transported
## [17] Not Transported Transported Transported Not Transported
## Levels: Transported Not Transported
We can then transform this into a form that can be uploaded to Kaggle.
kaggle_submission <- tibble(
PassengerId = data_test$PassengerId,
Transported = preds_test$.pred_class
) %>%
mutate(
Transported = ifelse(Transported == "Transported", "True", "False")
)
kaggle_submission
This can then be saved out to a csv file and uploaded to Kaggle to see how well we did.
write_csv(kaggle_submission, "kaggle_submission.csv")
CONGRATULATIONS!! You have now trained a basic machine learning model in R and used it to answer a Kaggle question.